Delhi Housing Poster

🔍 Uncover the Hidden Patterns in Delhi's Real Estate Market¶

Powered by Machine Learning, Data Visualization & Dashboarding¶


✅ Built a Model using predictive features
📊 Created interactive Plotly visuals to explore rental dynamics
📌 Delivered sharp insights into locality-based pricing
📈 Integrated Power BI Dashboard for a real-time user experience


“This is not just prediction — it's storytelling with data.” 💡

📦 Dataset Overview¶

The dataset consists of rich quantitative, categorical, and geospatial features that influence housing rental prices in Delhi.


🏠 House Features¶

  • size_sq_ft – Area of the house in square feet
  • propertyType – Type of property (e.g., Apartment, Villa, Studio)
  • bedrooms – Number of bedrooms

📍 Location Features¶

  • latitude, longitude – Geographic coordinates of the house
  • localityName – Specific locality within the city
  • suburbName – Suburban classification of the region
  • cityName – City name (Delhi)

💰 Rental Information¶

  • price – Monthly asking rent for the property

🏢 Agency Details¶

  • companyName – Real estate agency or listing company

🗺️ Proximity to Key Landmarks (Geodesic distance only)¶

  • closest_metro_station_km – Distance to the nearest Metro Station
  • AP_dist_km – Distance to Indira Gandhi International Airport
  • Aiims_dist_km – Distance to AIIMS Delhi (a major government hospital)
  • NDRLW_dist_km – Distance to New Delhi Railway Station

📊 This diverse feature set enables powerful rental price predictions based on locality, amenities, size, and landmark access.

In [191]:
## Libraries
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split, StratifiedShuffleSplit
from sklearn.preprocessing import LabelEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score 
from sklearn.model_selection import RandomizedSearchCV

import seaborn as sns
import matplotlib.pyplot as plt

import plotly.express as px
import plotly.io as pio
import plotly.offline as pyo
pio.renderers.default = "iframe"
pio.templates.default = "plotly_dark"  
pio.renderers.default = 'notebook_connected'

📁 1. Data Loading & Exploration¶

In [115]:
house_rent = pd.read_csv("Project_data.csv")
In [116]:
house_rent.head()
Out[116]:
Column1 size_sq_ft propertyType bedrooms latitude longitude localityName suburbName cityName price companyName closest_mtero_station_km AP_dist_km Aiims_dist_km NDRLW_dist_km
0 0 400 Independent Floor 1 28.641010 77.284386 Swasthya Vihar Delhi East Delhi 9000 Dream Homez 0.577495 21.741188 11.119239 6.227231
1 1 1050 Apartment 2 28.594969 77.298668 mayur vihar phase 1 Delhi East Delhi 20000 Rupak Properties Stock 0.417142 21.401856 9.419061 9.217502
2 2 2250 Independent Floor 2 28.641806 77.293922 Swasthya Vihar Delhi East Delhi 28000 Aashiyana Real Estate 0.125136 22.620365 11.829486 7.159184
3 3 1350 Independent Floor 2 28.644363 77.293228 Krishna Nagar Delhi East Delhi 28000 Shivam Real Estate 0.371709 22.681201 11.982708 7.097348
4 4 450 Apartment 2 28.594736 77.311150 New Ashok Nagar Delhi East Delhi 12500 Shree Properties 1.087760 22.592810 10.571573 10.263271
In [117]:
house_rent.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17890 entries, 0 to 17889
Data columns (total 15 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Column1                   17890 non-null  int64  
 1   size_sq_ft                17890 non-null  int64  
 2   propertyType              17890 non-null  object 
 3   bedrooms                  17890 non-null  int64  
 4   latitude                  17890 non-null  float64
 5   longitude                 17890 non-null  float64
 6   localityName              17890 non-null  object 
 7   suburbName                17890 non-null  object 
 8   cityName                  17890 non-null  object 
 9   price                     17890 non-null  int64  
 10  companyName               17890 non-null  object 
 11  closest_mtero_station_km  17890 non-null  float64
 12  AP_dist_km                17890 non-null  float64
 13  Aiims_dist_km             17890 non-null  float64
 14  NDRLW_dist_km             17890 non-null  float64
dtypes: float64(6), int64(4), object(5)
memory usage: 2.0+ MB

NO null values present in any feautre

In [118]:
house_rent.describe()
Out[118]:
Column1 size_sq_ft bedrooms latitude longitude price closest_mtero_station_km AP_dist_km Aiims_dist_km NDRLW_dist_km
count 17890.000000 17890.000000 17890.000000 17890.000000 17890.000000 1.789000e+04 17890.000000 17890.000000 17890.000000 17890.000000
mean 8944.500000 1176.342091 2.168865 28.609382 77.168368 3.345196e+04 0.931495 13.727784 11.238134 11.421994
std 5164.542493 873.751044 0.971414 0.099547 0.097611 8.802054e+04 8.287856 11.357063 11.167202 11.063323
min 0.000000 100.000000 1.000000 19.185120 73.213829 1.200000e+03 0.000692 1.784779 0.634508 0.722023
25% 4472.250000 620.000000 1.000000 28.562540 77.103718 1.350000e+04 0.457782 11.018715 7.769267 7.986813
50% 8944.500000 900.000000 2.000000 28.611803 77.168755 2.200000e+04 0.698560 13.184035 10.515524 11.015571
75% 13416.750000 1600.000000 3.000000 28.651593 77.224998 3.500000e+04 1.087740 17.163502 15.514042 15.192483
max 17889.000000 16521.000000 15.000000 28.872597 80.358467 5.885646e+06 1096.479453 1109.894053 1115.621439 1123.778457

The size_sq_ft and price features contain significant outliers, as their maximum values (16,521 sq. ft and ₹5,885,646 respectively) are drastically higher than their means (1,176 sq. ft and ₹33,451), indicating strong right-skewed distributions.

In [119]:
df = house_rent.duplicated()
df.value_counts()
Out[119]:
False    17890
Name: count, dtype: int64

dataset also has no duplicate values

📊 3. Exploratory Data Analysis (EDA)¶

In [120]:
print(f"We have {house_rent['companyName'].unique().shape} company")
print(f"Our dataset based only {house_rent['cityName'].unique().shape} City")
We have (1387,) company
Our dataset based only (1,) City
In [121]:
print(f"We have {house_rent['localityName'].unique().shape}Locality")
We have (781,)Locality
In [122]:
house_rent['suburbName'].unique()
Out[122]:
array(['Delhi East', 'Rohini', 'Delhi South', 'West Delhi', 'North Delhi',
       'Dwarka', 'Delhi Central', 'Other', 'South West Delhi',
       'Delhi North', 'North West Delhi', 'Delhi West'], dtype=object)
In [123]:
print("This table shows the average asking price for each suburb area.")
house_rent.groupby('suburbName')['price'].mean().sort_values()
This table shows the average asking price for each suburb area.
Out[123]:
suburbName
South West Delhi    16848.697674
Delhi East          17650.752199
West Delhi          23735.962646
Rohini              23820.437956
North West Delhi    24254.545455
Dwarka              28285.025051
Delhi North         29045.454545
Delhi West          29620.156051
North Delhi         30469.256390
Other               35485.701774
Delhi Central       35606.693997
Delhi South         50311.178448
Name: price, dtype: float64

The dataset contains inconsistencies in suburb names, such as treating "Delhi North" and "North Delhi" (or "Delhi West" and "West Delhi") as separate entries, even though they refer to the same region. This may affect suburb-level analysis and requires data cleaning.

In [124]:
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi North','North Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi West','West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Rohini','North West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Dwarka','South West Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi South','South Delhi')
house_rent['suburbName'] = house_rent['suburbName'].replace('Delhi East','East Delhi')
In [125]:
other_suburbs = house_rent[house_rent['suburbName'].str.contains("^Other", na=False)]
In [126]:
# 1. Define mapping from localityName to suburbName
locality_to_suburb = {
    'laxmi nagar': 'East Delhi',
    'sultanpur': 'South Delhi',
    'chittaranjan park': 'South Delhi',
    'kirari suleman nagar': 'North West Delhi',
    'khirki extension': 'South Delhi',
    'khirki extension panchsheel vihar': 'South Delhi',
    'mansa ram park': 'South West Delhi',
    'govindpuri': 'South Delhi',
    'govindpuri Main': 'South Delhi',
    'pitampura': 'North West Delhi',
    'rajdhani enclave': 'North West Delhi',
    'vikaspuri':'West Delhi',
    'west end' : 'South Delhi',
    'new rajendra nagar':'Delhi Central'
}

# 2. Convert localityName to lowercase to ensure case-insensitive matching
house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()

# 3. Fill suburbName based on the mapping
house_rent['suburbName'] = house_rent.apply(
    lambda row: locality_to_suburb[row['localityName_clean']]
    if row['localityName_clean'] in locality_to_suburb else row['suburbName'],
    axis=1
)
In [127]:
# Filter rows where suburbName is 'Other Area' or contains 'Other'
other_mask = house_rent['suburbName'].str.lower().str.contains('other', na=False)

# Count how many times each locality appears in these rows
locality_freq_in_other = house_rent[other_mask]['localityName'].value_counts()

# Display result
print(locality_freq_in_other.head(20))
localityName
Poorvi Pitampura                                 71
Sector-7 Rohini                                  65
Sector-18 Dwarka                                 54
Prashant Vihar Sector 14                         41
Rohini Sector 9                                  40
Uttari Pitampura                                 32
Jangpura Extension                               30
Neeti Bagh                                       29
South Extension Part 1                           28
Sector 6 Rohini                                  24
DLF Phase 5                                      22
Dr Mukherjee Nagar West Bhai Parmanand Colony    17
Abul Fazal Enclave Jamia Nagar                   17
Jawahar Park                                     17
Khanpur Krishna Park                             17
Sant Nagar                                       16
Bharat Vihar                                     16
Sector 15 Rohini                                 16
Paschim Vihar A 1 Block                          16
Bank Enclave                                     16
Name: count, dtype: int64
In [128]:
locality_to_suburb = {
    'jawahar park': 'South Delhi',
    'khanpur krishna park': 'South Delhi',
    'dr mukherjee nagar west bhai parmanand colony': 'North Delhi',
    'jangpura extension': 'South Delhi',
    'abul fazal enclave jamia nagar': 'South Delhi',
    'sant nagar': 'South Delhi',
    'bharat vihar': 'West Delhi',
    'gtb nagar': 'North Delhi',
    'saidabad': 'East Delhi',
    'raju park': 'South Delhi',
    'jasola vihar sector 8 road': 'South Delhi',
    'chhattarpur enclave phase1': 'South Delhi',
    'mangal bazar road': 'Delhi Central',
    'jamia nagar': 'South Delhi',
    'aya nagar': 'South Delhi',
    'mayur vihar phase 2': 'East Delhi',
    'mayur vihar phase 3': 'East Delhi',
    'amar colony': 'South Delhi'
}

house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()

house_rent['suburbName'] = house_rent.apply(
    lambda row: locality_to_suburb[row['localityName_clean']]
    if row['localityName_clean'] in locality_to_suburb else row['suburbName'],
    axis=1
)
In [129]:
# Convert locality name to lowercase for safe matching
house_rent['localityName_clean'] = house_rent['localityName'].str.lower().str.strip()
# Corrected: Use lowercase strings in .str.contains()
house_rent.loc[house_rent['localityName_clean'].str.contains('rohini', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('pitampura', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('prashant', na=False), 'suburbName'] = 'North West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('chattarpur', na=False), 'suburbName'] = 'South Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('dwarka', na=False), 'suburbName'] = 'South West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('paschim vihar', na=False), 'suburbName'] = 'West Delhi'
house_rent.loc[house_rent['localityName_clean'].str.contains('punjabhi bagh', na=False), 'suburbName'] = 'West Delhi'
In [130]:
# this is the corrected table of mean price by suburb area
house_rent.groupby('suburbName')['price'].mean().sort_values()
Out[130]:
suburbName
East Delhi          16853.991714
West Delhi          25484.500220
South West Delhi    26965.963132
North West Delhi    27532.220395
North Delhi         28306.407756
Delhi Central       35544.416035
South Delhi         46637.124554
Other               60501.648364
Name: price, dtype: float64

🧹 2. Data Cleaning¶

In [131]:
# Using boxplot to detect outliers
plt.figure(figsize=(15, 6))  
fig = sns.boxplot(data=house_rent)

plt.xticks(rotation=45) 
plt.title("Boxplot of House Rent Dataset")
plt.show()
No description has been provided for this image

The price feature contained outliers, which were handled using the Quantile-based method for better distribution and model performance.

In [132]:
q1 = house_rent['price'].quantile(0.25)
q3 = house_rent['price'].quantile(0.75)

IQR = q3-q1

house_rent = house_rent[(house_rent['price']>=q1-1.5*IQR)&
                        (house_rent['price']<=q3+1.5*IQR)]
In [133]:
plt.figure(figsize=(15, 6))  # Increase width
fig = sns.boxplot(data=house_rent)

plt.xticks(rotation=45)  # Rotate x-axis labels for better visibility
plt.title("Boxplot of House Rent Dataset")
plt.show()
No description has been provided for this image

Although some outliers still exist, the chart clearly shows that most price values fall below ₹55,000, so we manually removed extreme values beyond this threshold for cleaner analysis.

In [134]:
house_rent = house_rent[house_rent["price"]<=55000]
In [135]:
fig = sns.boxplot(house_rent["price"])
fig
Out[135]:
<Axes: ylabel='price'>
No description has been provided for this image
In [136]:
house_rent = house_rent[house_rent["size_sq_ft"]<=2400]
In [137]:
fig = sns.boxplot(house_rent["size_sq_ft"])
fig
Out[137]:
<Axes: ylabel='size_sq_ft'>
No description has been provided for this image

Since the dataset focuses on rental prices in Delhi, we filtered the data to include only entries within the geographic boundaries of Delhi using latitude and longitude ranges.

In [138]:
house_rent = house_rent[house_rent["longitude"] <= 77.39103]
house_rent = house_rent[house_rent["longitude"] >= 76.95978]
In [139]:
le = LabelEncoder()
house_rent['locality_encoded'] = le.fit_transform(house_rent['localityName_clean'])
In [140]:
locality_price_mean = house_rent.groupby('localityName_clean')['price'].mean()
house_rent['locality_encoded'] = house_rent['localityName_clean'].map(locality_price_mean)
In [141]:
house_rent['suburbName_clean'] = house_rent['suburbName'].str.lower().str.strip()
In [142]:
# drop unneccessary features
house_rent = house_rent.drop("Column1",axis=1)
house_rent = house_rent.drop("suburbName",axis=1)
house_rent = house_rent.drop("localityName",axis=1)
house_rent = house_rent.drop("companyName",axis=1)
house_rent = house_rent.drop("closest_mtero_station_km",axis=1)
house_rent = house_rent.drop("Aiims_dist_km",axis=1)
house_rent = house_rent.drop("NDRLW_dist_km",axis=1)
In [143]:
house_rent.head()
Out[143]:
size_sq_ft propertyType bedrooms latitude longitude cityName price AP_dist_km localityName_clean locality_encoded suburbName_clean
0 400 Independent Floor 1 28.641010 77.284386 Delhi 9000 21.741188 swasthya vihar 22251.162791 east delhi
1 1050 Apartment 2 28.594969 77.298668 Delhi 20000 21.401856 mayur vihar phase 1 14065.000000 east delhi
2 2250 Independent Floor 2 28.641806 77.293922 Delhi 28000 22.620365 swasthya vihar 22251.162791 east delhi
3 1350 Independent Floor 2 28.644363 77.293228 Delhi 28000 22.681201 krishna nagar 20858.333333 east delhi
4 450 Apartment 2 28.594736 77.311150 Delhi 12500 22.592810 new ashok nagar 9664.009112 east delhi
In [144]:
housing_num = house_rent.drop(["propertyType","suburbName_clean",'localityName_clean','cityName'] ,axis=1)
In [145]:
original_count = 17890
cleaned_count = len(house_rent)

rows_dropped = original_count - cleaned_count
percentage_lost = (rows_dropped / original_count) * 100

print(f"Rows dropped: {rows_dropped} ({percentage_lost:.2f}%)")
Rows dropped: 1986 (11.10%)

🧹 Data Cleaning Summary¶

We lost approximately 11.10% of our data during the cleaning process due to outlier removal, missing values, and irrelevant entries.
⚠️ This step was crucial to ensure data quality and model reliability.

Looking for Correlations¶

In [146]:
corr_matrix = housing_num.corr()
plt.figure(figsize=(10, 6))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)

plt.title("Correlation Heatmap")
plt.show()
No description has been provided for this image
In [147]:
corr_matrix["price"].sort_values(ascending=False)
Out[147]:
price               1.000000
size_sq_ft          0.766358
locality_encoded    0.669048
bedrooms            0.665165
latitude            0.016233
longitude          -0.162069
AP_dist_km         -0.182466
Name: price, dtype: float64

Discover and Visualize the Data to Gain Insights¶

In [148]:
#  Overall Feature Distributions
%matplotlib inline
house_rent.hist(bins=50, figsize=(20,15))
plt.show()
No description has been provided for this image
In [149]:
# Price Distribution of Houses
import plotly.io as pio

fig = px.histogram(house_rent ,x = "price")

fig.update_layout(
    title = "Price Range",
    title_x = 0.5,
    xaxis_title = "Price",
    yaxis_title  = "No. of House",
    font = dict(size= 16)
)

fig.update_traces (
    hovertemplate ='<b>Price range of house:</b> %{x}<br><b>No.of house:</b> %{y}'
)
fig.show()
In [190]:
# Distribution of Property Types
import plotly.io as pio
propertyType = house_rent["propertyType"].value_counts().reset_index()
propertyType.columns = ['Property Type', 'count']

fig = px.bar(
    propertyType, 
    x='Property Type',
    y='count' ,
    color_discrete_sequence=['orange'], 
    title="Property Distribution"
)

fig.update_layout(
    title_x=0.5,  # Center title
    xaxis_title="Property Type",
    yaxis_title='count',
    font = dict(size= 16),
    height=500, 
    width=1200
)

fig.show()
In [189]:
#  Suburban Property Distribution
import plotly.io as pio
propertyType = house_rent["suburbName_clean"].value_counts().reset_index()
propertyType.columns = ['Suburban Name', 'count']

fig = px.pie(
    propertyType, 
    names='Suburban Name', 
    values='count', 
    color_discrete_sequence=px.colors.qualitative.Vivid, 
    title="Property Distribution",
)

fig.update_layout(
    title_x=0.5,  # Center title
    title_font=dict(size=24),
    height=500, 
    width=1200
)

fig.show()
In [152]:
#  Geographic Distribution of Rental Listings
import plotly.io as pio
pio.renderers.default = "notebook"
fig1 = px.scatter(house_rent,x="longitude", y="latitude")
fig1.show()
In [153]:
#  Rental Property Distribution on Delhi Map
import matplotlib.image as mpimg

Delhi_img = mpimg.imread("Delhi.jpg")  

ax = house_rent.plot(kind="scatter", x="longitude", y="latitude",
                      colorbar=False, alpha=0.5)

plt.gca().set_facecolor('black')

plt.imshow(Delhi_img, extent=[76.85, 77.41, 28.41, 28.9], alpha=0.6)

plt.show()
No description has been provided for this image
In [154]:
# Distribution of House Sizes (in sq ft)
import plotly.io as pio
fig = px.histogram(
    house_rent, 
    x="size_sq_ft", 
    color_discrete_sequence=['yellow'],
)

fig.update_layout(
    title="Distribution of Size (sq ft)",
    xaxis_title="Size (sq ft)",
    yaxis_title="Count",
    title_x=0.5  # Center the title
)

fig.show()
In [155]:
# Create a price category column for stratification
house_rent["size_cat"] = pd.cut(house_rent["size_sq_ft"],
                               bins=[0,500,1000,1500,2000,np.inf],
                               labels=[1,2,3,4,5])
In [156]:
# House Size Category Distribution
import plotly.io as pio
fig = px.histogram(
    house_rent, 
    x="size_cat", 
    color="size_cat",  # Ensure color categories are applied correctly
    color_discrete_sequence=["red", "blue", "green", "purple", "orange"]
)

fig.show()
In [157]:
# We drop longitude and latitude feature more convenience
house_rent = house_rent.drop('latitude',axis=1)
house_rent = house_rent.drop('longitude',axis=1)
In [158]:
house_rent = house_rent.reset_index(drop=True)

🤖 4. Feature Selection & Modeling¶

Create a Test Set¶

In [159]:
# Perform stratified train-test split
split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(house_rent, house_rent["size_cat"]):
    train_set = house_rent.loc[train_index]
    test_set = house_rent.loc[test_index]

print("Training set size:", train_set.shape)
print("Testing set size:", test_set.shape)
Training set size: (12723, 10)
Testing set size: (3181, 10)
In [160]:
train_set.head()
Out[160]:
size_sq_ft propertyType bedrooms cityName price AP_dist_km localityName_clean locality_encoded suburbName_clean size_cat
15292 800 Apartment 2 Delhi 12000 17.273620 sri niwaspuri 15000.000000 other 2
15089 1000 Apartment 2 Delhi 22000 16.885609 hari nagar ashram 26500.000000 other 2
1827 1000 Independent Floor 1 Delhi 15000 21.707782 mayur vihar 19485.046729 east delhi 2
3663 1200 Independent Floor 3 Delhi 17000 11.820051 chattarpur 15355.663825 south delhi 3
5897 1400 Apartment 2 Delhi 22000 12.627945 paschim vihar 25747.144457 west delhi 3
In [161]:
test_set.head()
Out[161]:
size_sq_ft propertyType bedrooms cityName price AP_dist_km localityName_clean locality_encoded suburbName_clean size_cat
11091 654 Independent Floor 1 Delhi 13000 13.278257 patel nagar 19473.086620 delhi central 2
6700 1800 Independent Floor 3 Delhi 40000 13.186740 paschim vihar 25747.144457 west delhi 4
10251 1050 Independent Floor 2 Delhi 10500 4.129440 palam 21148.214286 south west delhi 3
10886 485 Independent Floor 1 Delhi 16500 12.665641 patel nagar 19473.086620 delhi central 1
13443 1050 Independent Floor 2 Delhi 19000 17.256704 sector-7 rohini 18853.846154 north west delhi 3
In [162]:
house_rent["size_cat"].value_counts(normalize=True)
Out[162]:
size_cat
2    0.450578
1    0.194982
3    0.172032
4    0.158011
5    0.024396
Name: proportion, dtype: float64
In [163]:
train_set["size_cat"].value_counts(normalize=True)
Out[163]:
size_cat
2    0.450601
1    0.195001
3    0.172051
4    0.157982
5    0.024365
Name: proportion, dtype: float64
In [164]:
test_set["size_cat"].value_counts(normalize=True)
Out[164]:
size_cat
2    0.450487
1    0.194907
3    0.171959
4    0.158126
5    0.024521
Name: proportion, dtype: float64
In [165]:
for set_ in (train_set,test_set):
    set_.drop("size_cat",axis=1,inplace=True)

Prepare the Data for Machine Learning Algorithms¶

In [166]:
# Prepare training data
X_train = train_set.drop("price", axis=1)
y_train = train_set["price"]
In [167]:
num_attribs = list(X_train.select_dtypes(include=["number"]))
cat_attribs = ["propertyType","suburbName_clean" ]

# Create column transformer
full_pipeline = ColumnTransformer([
    ("num", StandardScaler(), num_attribs),
    ("cat", OneHotEncoder(handle_unknown='ignore', sparse_output=False), cat_attribs)
])
In [168]:
X_train.shape
Out[168]:
(12723, 8)
In [169]:
X_train_prepared = full_pipeline.fit_transform(X_train)
In [170]:
X_test = test_set.drop("price", axis=1)
y_test = test_set["price"]
In [171]:
X_test_prepared = full_pipeline.transform(X_test)
In [172]:
X_train_prepared
Out[172]:
array([[-0.37198282, -0.01166857,  0.65636707, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03138368, -0.01166857,  0.58602418, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03138368, -1.19934185,  1.46024139, ...,  0.        ,
         0.        ,  0.        ],
       ...,
       [-0.77534932, -1.19934185, -0.1261086 , ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03138368, -0.01166857,  1.00229333, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.03138368, -0.01166857, -0.18139006, ...,  1.        ,
         0.        ,  0.        ]])
In [173]:
from sklearn.ensemble import RandomForestRegressor

Tuning RandomForestRegressor with RandomizedSearchCV¶

In [174]:
rf = RandomForestRegressor(random_state=42)

param_dist = {
    'n_estimators': [100, 200, 300, 400],
    'max_depth': [10, 20, 30, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2']
}

random_search = RandomizedSearchCV(
    estimator=rf,
    param_distributions=param_dist,
    n_iter=30,           
    cv=5,                
    verbose=2,
    n_jobs=-1,
    scoring='r2',
    random_state=42
)

random_search.fit(X_train_prepared, y_train)

print("Best Params:", random_search.best_params_)

best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test_prepared)
print("MAE:", mean_absolute_error(y_test, y_pred))
print("R²:", r2_score(y_test, y_pred))
Fitting 5 folds for each of 30 candidates, totalling 150 fits
Best Params: {'n_estimators': 400, 'min_samples_split': 2, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': None}
MAE: 3369.0222959375624
R²: 0.8218728583717225

Try different models¶

In [ ]:
# Randomforest regressor
model = RandomForestRegressor(n_estimators=  400, min_samples_split= 2, min_samples_leaf =2, max_features= 'log2')
model.fit(X_train_prepared, y_train)

y_pred = model.predict(X_test_prepared)

mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error: {mae:.2f}")
print(f"R² Score: {r2:.2f}")
Mean Absolute Error: 3372.42
R² Score: 0.82
In [176]:
from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=150, max_depth=10, learning_rate=0.1, random_state=42)
xgb_model.fit(X_train_prepared, y_train)

y_pred_xgb = xgb_model.predict(X_test_prepared)

print("XGBoost MAE:", mean_absolute_error(y_test, y_pred_xgb))
print("XGBoost R²:", r2_score(y_test, y_pred_xgb))
XGBoost MAE: 3420.8210403168964
XGBoost R²: 0.8099862379060654
In [177]:
from lightgbm import LGBMRegressor

lgb_model = LGBMRegressor(n_estimators=150, max_depth=10, learning_rate=0.1, random_state=42)
lgb_model.fit(X_train_prepared, y_train)

y_pred_lgb = lgb_model.predict(X_test_prepared)

print("LightGBM MAE:", mean_absolute_error(y_test, y_pred_lgb))
print("LightGBM R²:", r2_score(y_test, y_pred_lgb))
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000820 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 737
[LightGBM] [Info] Number of data points in the train set: 12723, number of used features: 15
[LightGBM] [Info] Start training from score 21841.225026
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
LightGBM MAE: 3420.458390342412
LightGBM R²: 0.8207300652203096
In [178]:
from catboost import CatBoostRegressor

cat_model = CatBoostRegressor(iterations=300, depth=10, learning_rate=0.1, verbose=0)
cat_model.fit(X_train_prepared, y_train)

y_pred_cat = cat_model.predict(X_test_prepared)

print("CatBoost MAE:", mean_absolute_error(y_test, y_pred_cat))
print("CatBoost R²:", r2_score(y_test, y_pred_cat))
CatBoost MAE: 3398.996688811509
CatBoost R²: 0.8213430526692512

📈 5. Model Evaluation Summary¶

🌲 Random Forest Regressor achieved the highest performance with an impressive R² score of 82% on the test set.¶

📦 This model has been saved for future rent price prediction tasks.¶

In [179]:
my_model =model
In [180]:
import joblib

# Save the model
joblib.dump(my_model, "house_price_model.pkl")

print("Model saved successfully!")
Model saved successfully!
In [181]:
my_model_loaded = joblib.load("house_price_model.pkl") # DIFF
In [182]:
# let's try the full preprocessing pipeline on a few training instances
some_data = house_rent.iloc[:5]
some_labels = y_train.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)

print("Predictions:", model.predict(some_data_prepared))
Predictions: [ 9697.23877809 23381.70283698 33055.84970388 27836.59551783
 11366.20807406]
In [183]:
print("Labels:", list(some_labels))
Labels: [12000, 22000, 15000, 17000, 22000]
In [184]:
# Create the mapping from locality name to its encoded value
locality_encoding_map = house_rent.groupby("localityName_clean")["locality_encoded"].first().to_dict()
In [185]:
# This function converts the user's locality name into 
# its corresponding encoded value (mean price), preparing 
# the input for the rent prediction model.

def prepare_input_with_locality_name(user_input_dict):
    # Convert locality name to lowercase
    loc_clean = user_input_dict["localityName"].lower().strip()


    encoded_val = locality_encoding_map.get(loc_clean)

    if encoded_val is None:
        raise ValueError(f"❌ Unknown locality: {loc_clean}. Please check spelling.")

    user_input_dict["locality_encoded"] = encoded_val
    user_input_dict.pop("localityName")  

    return pd.DataFrame([user_input_dict])

📉 6. Prediction¶

In [186]:
user_input = {
    "size_sq_ft":500,
    "propertyType": "independent floor",
    "bedrooms": 2,
    "localityName": "geeta colony",
    "suburbName_clean": "east delhi",
    "AP_dist_km": 20,

}
new_data = prepare_input_with_locality_name(user_input)
new_data_prepared = full_pipeline.transform(new_data)
predicted_price = model.predict(new_data_prepared)

print(f"🏠 Predicted Rent: ₹{predicted_price[0]:,.2f}")
🏠 Predicted Rent: ₹10,997.24